Bayesian Adaptive Lasso 



o 

(N 



Or 



On 
O 
O 



>< 



Chenlei Leng, Minh Ngoc Tran and David Nott 
September 14, 2010 



Abstract 



We propose the Bayesian adaptive Lasso (BaLasso) for variable selection and coefficient 
ij^ '■ estimation in linear regression. The BaLasso is adaptive to the signal level by adopting 

' different shrinkage for different coefficients. Furthermore, we provide a model selection 

machinery for the BaLasso by assessing the posterior conditional mode estimates, motivated 
by the hierarchical Bayesian interpretation of the Lasso. Our formulation also permits 
prediction using a model averaging strategy. We discuss other variants of this new approach 
and provide a unified framework for variable selection using flexible penalties. Empirical 
' evidence of the attractiveness of the method is demonstrated via extensive simulation studies 

and data analysis. 
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1 Introduction 

Consider the linear regression problem 
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where y is an nx 1 vector of responses, X is an nxp matrix of covariates and e is an nx 1 vector 
of iid normal errors with mean zero and variance a 2 . As is usual in regression analysis, our 
major interests are to estimate /3 = (/3i,...,/L)', to identify its important covariates and to make 
accurate predictions. Without loss of generality, we assume y and X are centered so that fi is 
zero and can be omitted from the model. 



In an important paper, Tibshirani (1996) proposed the least absolute shrinkage and selec- 
tion operator (Lasso) for simultaneous variable selection and parameter estimation. The Lasso, 
formulated in the penalized likelihood framework, minimizes the residual sum of squares with a 
constraint on the l\ norm of j3. Formally, the Lasso solves 

p 

min(y-^) T (y-^) + A^|^|, (1) 

3=1 

where A > is the tuning parameter controlling the amount of penalty. The least angle regression 
(LARS) algorithm provides fast implementation of the Lasso solution (Efron et al., 2004; Osborne 
et al., 2000). Furthermore, the Lasso can be model selection consistent provided that the so- 
called irrepresentable condition on the design matrix is satisfied and that A is chosen judiciously 
(Zhao and Yu, 2006). 

However, if this condition does not hold, Zou (2006) and Zhao and Yu (2006) showed that the 
Lasso chooses the wrong model with non-vanishing probability, regardless of the sample size and 
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how A is chosen. The condition is almost necessary and sufficient for model selection consistency 
of Lasso, which requires that the predictors not in the model are not representable by predictors 
in the true model. This condition can be easily violated due to the collinearity between the 
predictors. To address this issue, Zou (2006) and Wang et al. (2006) proposed to use adaptive 
Lasso (aLasso) which gives consistent model selection. The final inference procedure, thereafter, 
is based on a single selected model. This may bring undesirable risk properties as discussed 
by Pdtscher and Leeb (2009). Meinshausen and Buhlmann (2009) introduced sub-sampling in 
model selection that improves the Lasso. 

The Lasso estimator can be interpreted as the posterior mode using normal likelihood and iid 
Laplace prior for (3 (Tibshirani, 1996). Yuan and Lin (2006) studied an empirical Bayes variable 
selection method targeting at finding this mode. The first explicit treatment of the Bayesian 
Lasso (BLasso), which exploits model inference via posterior distributions, has been proposed by 
Park and Casella (2008). Hans (2010) considers a formal Bayesian approach to exploring model 
uncertainty with lasso type priors on parameters in submodels. Griffin and Brown (2010) have 
previously considered generalizing the Bayesian lasso in various ways including the use of separate 
scale parameters for different coefficients in the Laplace prior with gamma mixing distributions 
for the scale parameters. This is similar to the priors we use here, but Griffin and Brown (2010) 
focused on finding posterior mode estimates via an EM algorithm whereas our objectives here are 
somewhat broader. In particular we aim to investigate MCMC computational methods for these 
priors, estimates of regression coefficients other than the mode, different choices for smoothing 
parameters, model averaging strategies which explore model uncertainty for predictive purposes 
and generalizations beyond the linear model. 

Although the Lasso was originally designed for variable selection, the BLasso loses this at- 
tractive property, not setting any of the coefficients to zero. A post hoc thresholding rule may 
overcome this difficulty but it brings the problem of threshold selection. Alternatively, Kyung 
et al. (2009) recommended to use the credible interval on the posterior mean. Although it gives 
variable selection, this suggestion fails to explore the uncertainty in the model space. On the 
other hand, the so-called spike and slab prior, in which the scale parameter for a coefficient 
is a mixture of a point mass at zero and a proper density function such as normal or double 
exponential (Yuan and Lin, 2005), allows exploration of model space at the expense of increased 
computation for a full Bayesian posterior. 

This work is motivated by the need to explore model uncertainty and to achieve parsimony. 
With these objectives, we consider the following adaptive Lasso estimator: 

mm (y -X0) T (y-X f3) + J2*j\Pjl (2) 

where different penalty parameters are used for the regression coefficients. Naturally, for the 
unimportant covariates, we should put larger penalty parameters Xj on their corresponding 
coefficients. This strategy was proposed by Zou (2006) and Wang et al. (2006) by using some 
preliminary estimates of fi such as the least-squares estimate /3° and modifying Xj as X/\$j\. 
Our treatment is completely different and is motivated by the following arguments. Suppose 
tentatively that we have a posterior distribution on {Xj}^ =1 . By drawing random samples from 
this distribution and plugging these into (2), we can solve for (3 using fast algorithms developed 
for Lasso (Efron et al., 2004; Figueiredo et al., 2007) and subsequently obtain an array of 
(sparse) models. These models can be used not only for exploring model uncertainty, but also 
for prediction with a variety of methods akin to Bayesian model averaging. Since there are p 
tuning parameters, a hierarchical model is proposed to alleviate the problem of estimating many 
parameters. We develop an efficient Gibbs sampler for posterior inference. 

The BaLasso permits a unified treatment for variable selection with flexible penalties, using 
the least sqaures approximation (Wang and leng, 2007). The extension encompasses generalized 
linear models, Cox's model and other parametric models as special cases. We outline novel 
applications of BaLasso when structured penalties are present, for example, grouped variable 
selection (Yuan and Lin, 2007) and variable selection with a prior hierarchical structure (Zhao, 
Rocha and Yu, 2009). 
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The rest of the paper is organized as follows. The Bayesian adaptive Lasso (BaLasso) method 
is presented in Section 2. Furthermore, we propose two approaches for estimating the tuning 
parameter vector A= (Ai,...,A p )' and give an explanation for the shrinkage adaptivity. Sec- 
tion 3 discusses model selection and Bayesian model averaging. In Section 4, the finite sample 
performance of BaLasso is illustrated via simulation studies, and analysis of two real datasets. 
Section 5 presents a unified framework which deals with variable selection in models with struc- 
tured penalties. Section 6 gives concluding remarks. A Matlab implementation is available from 
the authors' homepage. The software is very general and deals with many parametric models 
encountered in practice. 



2 Bayesian Adaptive Lasso 

The i\ penalty corresponds to a conditional Laplace prior (Tibshirani, 1996) as 



A -AI&I/VS* 



which can be represented as a scale mixture of normals with an exponential mixing density 
(Andrews and Mallows, 1974) 

-e- A l-l= f°_^ e - 2 /(2 S )^! e -AV2 ds . 
2 J Q 2 

This motivates the following hierarchical BLasso model (Park and Casella, 2008) 

y\X,f5,a 2 ~ N n (X/3,a 2 I n ) 
P\<t 2 ,t 2 ,...,t 2 ~ N p (0 p ,a 2 D T ) (3) 
D T = diag(ri,...,7-p) 



with the following priors on a 2 and r= (rf ,...,t 2 ): 

a 2 ,r 2 , ...,r 2 ~ n(a 2 )da 2 f[ ^e~^/ 2 dr 2 (4) 

for a 2 >0 and t 2 ,...,t 2 >0. Park and Casella (2008) suggested to use the improper prior 7t(ct 2 )oc 
1/cr 2 to model the error variance. 

As discussed in the introduction, the Lasso uses the same shrinkage for every coefficient and 
may not be consistent for certain design matrices in terms of model selection. This motivates us 
to replace (4) in the hierarchical structure by a more adaptive penalty 

a 2 ,r 2 ,...,r 2 ~ n(a 2 )da 2 f[ ^ e -^/ 2 dr 2 . (5) 

The major difference of this formulation is to allow different \ 2 , one for each coefficient. In- 
tuitively, if small penalty is applied to those covariates that are important and large penalty 
is applied to those which are unimportant, the Lasso estimate, as the posterior mode, can be 
model selection consistent (Zou, 2006; Wang et al. 2007). Indeed, as we will see in Section 2.2 
and in later numerical experiments, in the posterior distribution, the Aj's for zero /3j's will be 
much larger than those Aj's for nonzero /3,-'s. 

The Gibbs sampling scheme follows Park and Casella (2008). For Bayesian inference, the full 
conditional distribution of j3 is multivariate normal with mean A~ 1 X T y and variance a 2 A -1 , 
where A=X T X+D~ 1 . The full conditional for a 2 is inverse-gamma with shape parameter (n — 
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l)/2+p/2 and scale parameter (y-X /3) T (y-X (3)/2+/3 T D^ 1 j3/2 and rf,...,^ are conditionally 
independent, with 1/r? conditionally inverse-Gaussian with parameters 

where the inverse-Gaussian density is given by 

fix) = V^27rx- 3 / 2 exp| - H * { Z^ 2 1 ,x > 0. 

As observed in Park and Casella (2008), the Gibbs sampler with block updating of (5 and 
(rf ,...,Tp) is very fast. 



2.1 Choosing the Bayesian Adaptive Lasso Parameters 

We discuss two approaches for choosing BaLasso parameters in the Bayesian framework: the 
empirical Baycs (EB) method and the hierarchical Bayes (HB) approach using hyper-priors. The 
EB approach aims to estimate the Xj via marginal maximum likelihood, while the HB approach 
uses hypcrpriors on the Xj which enables posterior inference on these shrinkage parameters. 

Empirical Bayes (EB) Estimation. A natural choice is to estimate the hyper-parameters A,, 
by marginal maximum likelihood. However, in our framework, the marginal likelihood for the 
XjS is not available in closed form. To deal with such a problem, Casella (2001) proposed a multi- 
step approach based on an EM algorithm with the expectation in the E-step being approximated 
by the average from the Gibbs sampler. The updating rule then for Xj is easily seen to be 



where A^ is the estimate of Xj at the fcth stage and the expectation E x (k-i) (.) is approximated 

J (k— 1) 

by the average from the Gibbs sampler with the hyper-parameters are set to A^ 

Casella's method may be computationally expensive because many Gibbs sampler runs are 
needed. Atchade (2009) proposed a single-step approach based on stochastic approximation 
which can obtain the MLE of the hyper-parameters using a single Gibbs sampler run. In our 
framework, making the transformation Xj = e Sj , the updating rule for the hyper-parameters Sj 
can be seen as (Atchade 2009, Algorithm 3.1) 

where is the value of Sj at the nth iteration, t%j is the nth Gibbs sample of r?, and {a n } 
is a sequence of step-sizes such that 

In the following simulation, a n is set to 1/n. Strictly speaking, choosing a proper a n is an 
important problem of stochastic approximation which is beyond the scope of this paper. In 
practice, a n is often set after a few trials by justifying the convergence of iterations graphically. 

Hierarchical Model. Alternatively, A^s themselves can be treated as random variables and 
join the Gibbs updating by using an appropriate prior on A^. Here for simplicity and numerical 



tractability, we take the following gamma prior (Park and Casella, 2008) 



(A^-V 4 *?. (7) 
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Figure 1: (a)-(b): Gibbs samples for Ai and A2, respectively, (c)-(d): Trace plot for Ai and 
X^ by Atchade's method. 

The advantage of using such a prior is that the Gibbs sampling algorithm can be easily imple- 
mented. More specifically, when this prior is used, the full conditional of A^ is gamma with shape 
parameter l+r and rate parameter tJ+S. This specification allows A^ to join the other param- 
eters in the Gibbs sampler. Although the number of the penalty parameters A^ has increased 
to p in BaLasso from a single parameter in Lasso, the fact that the same prior is used on these 
parameters greatly reduces the degrees of freedom in specifying the prior. 

As a first choice, we can fix hyper-parameters r and S to some small values in order to get a 
flat prior. Alternatively, we can fix r and use an empirical Bayes approach where S is estimated. 
The updating rule for 5 (Casella, 2001) can be seen as 

5 {k) = 

Theoretically, we need not worry so much about how to select r because parameters that are 
deeper in the hierarchy have less effect on inference (Lehmann, 1998, p. 260). In our simulation 
study and data analysis, we use r=.l which gives a fairly flat prior and stable results. 

2.2 Adaptive shrinkage 

By allowing different A|, adaptive shrinkage on the coefficients is possible. We demonstrate the 
adaptivity by a simple simulation in which a data set of size 50 is generated from the model 

with P=(3, 0)', cr=l, e~AT(0,l). 

Because /3i^0, (3 2 =0 we expect that the EB and posterior estimate of X 2 would be much 
larger than that of Ai. As a result, a heavier penalty is put on j3 2 such that (3 2 is more likely to 
be shrunken to zero. This phenomenon is demonstrated graphically in Figure 1. Figure 1 (a)-(b) 
plot 10,000 Gibbs samples (after discarding 10,000 burn-in samples) for Ai and A2 (not A^, X 2 ), 
respectively. The posterior distribution of A2 is central around a value of 22 which is much larger 
than .39, the posterior median of Ai. Figure 1 (c)-(d) shows the trace plots of iterations A^ , 
X^ from Atchade's method. Marginal maximum likelihood estimates of Ai and A2 are 0.39 and 
19, respectively. In Figure 2 we plot EB and posterior mean estimates of A2 versus (3 2 when 
(3 2 varies from to 5. Clearly, both the EB and the posterior estimates of A2 decrease as (3 2 
increases, which demonstrates that lighter penalty is applied for stronger signals. 
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Figure 2: Plots of EB and posterior estimates of A2 versus /?2 

3 Inference 

3.1 Estimation and Model Selection 

For the adaptive Lasso, the usual methods to choose the A/s would be computationally de- 
manding. From the Bayesian perspective, one can draw MCMC samples based on BaLasso and 
get an estimated posterior quantity for j3. Like the original Bayesian Lasso, however, a full 
posterior exploration gives no sparse models and would fail as a model selection method. Here 
we take a hybrid Bayesian-frcquentist point of view in which coefficient estimation and variable 
selection are simultaneously conducted by plugging in an estimate of A into (2), where A might 
be the marginal maximum likelihood estimator, posterior median or posterior mean. Hereafter 
these suggested strategies are abbreviated as BaLasso-EB, BaLasso-Median, and BaLasso-Mean, 
respectively. 

With the presence of a posterior sample, we also propose another strategy for exploring 
model uncertainty. Let {A^}^! be Gibbs samples drawn from the hierarchical model (3), (5) 
and (7). For the sth Gibbs sample A^ s ^ = (A^ ,...,Ap S ^)', we plug A^ s ^ into (2) and then record 
the frequencies of each variable being chosen out of N samples. The final chosen model consists 
of those variables whose frequencies are not less than 0.5. This strategy will be abbreviated as 
BaLasso-Freq. The chosen model is somewhat similar in spirit to the so-called median probability 
(MP) model proposed by Barbieri and Berger (2004). 

As we will see in Section 4, all of our proposed strategies have surprising improvement in 
terms of variable selection over the original Lasso and the adaptive Lasso. 

3.2 A Model Averaging Strategy 

When model uncertainty is present, making inferences based on a single model may be dangerous. 
Using a set of models helps to account for this uncertainty and can provide improved inference. In 
the Bayesian framework, Bayesian model averaging (BMA) is widely used for prediction. BMA 
generally provides better predictive performance than a single chosen model, see Raftery et al. 
(f 997); Hoeting et al. (1999) and references therein. For making inference via multiple models, we 
use the hierarchical model approach for estimating A and refer to the strategy outlined below as 
BaLasso-BMA. It should be emphasized, however, that our model averaging strategy is unrelated 
to the usual formal Bayesian treatment of model uncertainty. Rather, our idea is simply to use 
an ensemble of sparse models for prediction obtained from sampling the posterior distribution of 
smoothing parameters and considering different sparse conditional mode estimates of regression 



Posterior mean 

EB estimate 
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coefficients for the smoothing parameters so obtained. 

Let A — (xa,Va) be a future observation and D— (X,y) be the past data. The posterior 
predictive distribution of A is given by 

p(A|£>) = J P (A\(3)p(f3\X,D)d(3p(X\D)dX. (8) 

Suppose that we measure predictive performance via a logarithmic scoring rule (Good, 1952), 
i.e., if g(A\D) is some distribution we use for prediction then our predictive performance is 
measured by logg(A|D) (where larger is better). Then for any fixed smoothing parameter vector 
A 

£(logp(A|D)-log P (A|A 0) £>)) = J log -g^L p(A |j>)dA 

is nonnegative because the right hand side is the Kullback-Leibler divergence betweenp(A|D) and 
p(A\\q,D). Hence prediction with p(A\D) is superior in this sense to prediction with p(A\Xo,D) 
with any choice of Xq . 

Our hierarchical model (3), (5) and (7) offers a natural way to estimate the predictive distri- 
bution (8), in which the integral is approximated by the average from Gibbs samples of A. For 
example, in the case of point prediction for y A with squared error loss, the ideal prediction is 

E(y A \D)= f x' A E(f3\X,D)p(X\D)dX=x A E(/3\D), 

where E(/3\D) can be estimated by the mean of Gibbs samples for j3. Write f}\ as the conditional 
posterior mode for (3 given A. One could approximate x' A E((3\D) by replacing E((3\D) with the 
conditional posterior mode /3; for some fixed value A of A. However, this ignores uncertainty in 
estimating the penalty parameters. An alternative strategy is to replace E((3\D,X) in the integral 
above with (3\ and to integrate it out accordingly. This should provide a better approximation 
to the full Bayes solution than the approach which uses a fixed A. In fact, we predict E{y A \D) 
by s ~ 1 J2i = i x Afi\( i i where A^, i = l,...,s, denote MCMC samples drawn from the posterior 
distribution of A. Note that this approach has advantages in interpretation over the fully Bayes' 
solution. By considering the models selected by the conditional posterior mode for different draws 
of A from p(X\y) we gain an ensemble of sparse models that can be used for interpretation. As 
will be seen in Section 4, when there is model uncertainty, BaLasso-BMA provides an ensemble 
of sparse models and may have better predictive performance than conditioning on a single fixed 
smoothing parameter vector A. 

4 Examples 

In this section we study the proposed methods through numerical examples. These methods are 
also compared to Lasso, aLasso and BLasso in terms of variable selection and predictions. We use 
the LARS algorithm of Efron et al. (2004) for Lasso and aLasso in which fivefold cross-validation 
is used to choose shrinkage parameters. In the adaptive Lasso, we either use the least squares 
estimate (Example 1 and 2) or the Lasso estimate (Example 3) as the preliminary estimate. For 
the optimization problem (2), we use the gradient projection algorithm developed by Figueiredo 
et al. (2007). 

4.1 Simulation 

Example 1 (Simple example). We simulate data sets from the model 

y = x'/3+ae, (9) 

where /3=(3, 1.5, 0, 0, 2, 0, 0, 0)', Xj follows N(0,1) marginally and the correlation between Xj 
and Xk is 0.5'- J_ ' c ', and e is iid N(0,1). We compare the performance of the proposed methods 
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30 
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:plications for Example 1. 


n 
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Lasso 


aLasso 


BaLasso-Frcq 


BaLasso-Median 


BaLasso-Mcan 


BaLasso-EB 


60 


9 





5 


8 


8 


9 


12 


120 


5 


10 


45 


66 


65 


66 


51 


300 


3 


12 


65 


83 


83 


85 


83 


300 


1 


12 


100 


100 


100 


100 


100 



Table 2: Frequency of correctly-fitted models over 100 replications for Example 2. 



in Section 3.1 to that of the original Lasso and adaptive Lasso. The performance is measured 
by the frequency of correctly-fitted models over 100 replications. The simulation results are 
summarized in Table 1 and suggest that the proposed methods perform better than Lasso and 
aLasso in model selection. 

Example 2 (Difficult example). For the second example, we use Example 1 in Zou (2006), 
for which the Lasso does not give consistent model selection, regardless of the sample size and 
how the tuning parameter A is chosen. Here /3 = (5.6, 5.6, 5.6, 0)' and the correlation matrix of 
x is such that cov(xj,Xk) = — -39, j<fc<4 and cor(a:j,X4) = .23, j<4. 

The experimental results are summarized in Table 2 in which the frequencies of correct 
selection are shown. We see that the original Lasso does not seem to give consistent model 
selection. For all the other methods, the frequencies of correct selection go to 1 as n increases 
and a decreases. In general, our proposed method for model selection performs better than 
aLasso. 

Example 3 (Large p example) . The variable selection problem with large p (even larger than 
n) is recently an active research area. We consider an example of this kind in which p = 100 
with various sample sizes n = 50, 100, 200. We set up a sparse recovery problem in which most of 
coefficients are zero except f3j = 5, j = 10, 20, ...,100. From the previous examples, the performances 
of the four methods BaLasso-Freq, BaLasso-Median, BaLaso-Mean and BaLasso-EB are similar. 
We therefore just consider the BaLasso-Mean as a representative and compare it to the adaptive 
Lasso which is generally superior to the Lasso. 

Table 3 summarizes our simulation results, in which the design matrix is simulated as in 
Example 1. BaLasso-Mean performs satisfactorily in this example and outperforms aLasso in 
variable selection. 

Example 4 (Prediction). In this example, we examine the predictive ability of BaLasso-BMA 
experimentally. As discussed in Section 3.2, when there is model uncertainty, making predictions 
conditioning on a single fixed parameter vector is not optimal predictivcly. Suppose that the 
dataset D is split into two sets: a training set D T and prediction set D p . Let A= (xa-iVa) ED P 
be a future observation and ijA be a prediction of ua based on D T . We measure the predictive 
performance by the prediction squared error (PSE) 

PSE =TlU E l?M-yA| 2 . (10) 

1 1 AeD p 
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Table 3: Frequency of correctly-fitted models over 100 replications for Example 3. 



riT = np 


a 


Lasso 


aLasso 


BLasso 


BaLasso-Mean 


BaLasso-BMA 


30 


1 


2.029 


1.976 


1.276 


1.175 


1.165 




3 


17.43 


17.37 


10.88 


15.51 


11.06 




5 


42.74 


42.13 


29.43 


41.32 


29.56 




10 


126.6 


126.2 


109.6 


123.9 


109.9 


100 


1 


1.449 


1.436 


1.044 


1.077 


1.032 




3 


12.69 


12.58 


9.662 


9.627 


9.485 




5 


34.89 


34.79 


25.79 


27.55 


25.83 




10 


117.6 


117.5 


105.7 


118.2 


106.5 


200 


1 


1.279 


1.274 


1.018 


1.036 


1.014 




3 


11.44 


11.40 


9.424 


9.326 


9.320 




5 


31.30 


31.18 


25.32 


25.36 


25.19 




10 


120.7 


120.7 


103.9 


108.8 


104.3 



Table 4: Prediction squared error averaged over 100 replications for the small-p case. 

We compare PSE of BaLasso-BMA to that of BaLasso-Mean in which jja = x' A $ where (3 is 
the solution to (2) with smoothing parameter vector fixed at the posterior mean of A. We also 
compare the predictive performance of BaLasso-BMA to that of the Lasso, aLasso, and the 
original Bayesian Lasso (BLasso). The implementation of BLasso is similar to BaLasso except 
that BLasso has a single smoothing parameter. 

We first consider a small-p case in which data sets are generated from model (9) but now 
with /3 — (3, 1.5, 0.1, 0.1, 2, 0, 0, 0)'. By adding two small effects we expect there to be model 
uncertainty. Table 4 presents the prediction squared errors averaged over 100 replications with 
various factors nr (size of training set), np (size of prediction set) and a. The experiment shows 
that BaLasso-BMA performs slightly better than BLasso and BaLasso-Mean, and much better 
than the Lasso and aLasso. 

Similarly, we consider a large-p case as in Example 3 but now with /3io = P20 = Pso = Pao = 
/?50 = .5 in order to get model uncertainty. The results are summarized in Table 5. Unlike for the 
small-p case, BLasso now performs surprisingly badly. This may be due to the fact that BLasso 
uses the same shrinkage for every coefficient. As shown, BaLasso-BMA outperforms the others. 

4.2 Real Examples 

Example 5: Body fat data. Percentage of body fat is one important measure of health, which 
can be accurately estimated by underwater weighing techniques. These techniques often require 
special equipment and are sometimes not convenient, thus fitting percent body fat to simple 
body measurements is a convenient way to predict body fat. Johnson (1996) introduced a data 
set in which percent body fat and 13 simple body measurements (such as weight, height and 
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tit — np 




Lasso 


aLasso 


BLasso 


BaLasso-Mean 


[ ) n T _ _ _ _ T > TV If A 

r>a.Lasso-r>lVlA 


100 


1 


3.501 


4.173 


9.574 


1.673 


1.234 




3 


15.49 


17.70 


ri7 /to 

27.42 


1 A O O 

10.88 


10.42 




5 


34.45 


39.81 


42.43 


28.66 


28.19 




10 


149.3 


178.1 


161.0 


124.5 


117.6 


200 


1 


2.468 


2.417 


5.231 


1.110 


1.072 




3 


17.11 


17.09 


15.12 


10.42 


10.22 




5 


44.49 


44.39 


33.92 


27.18 


27.06 




10 


148.1 


147.5 


136.1 


112.0 


108.9 



Tabic 5: Prediction squared error averaged over 100 replications for the large-p case. 

abdomen circumference) are recorded for 252 men (see Table 6 for the summarized data). This 
data set was also carefully analyzed by Hoeting et al. (1999). Following Hoeting et al., we omit 
the 42nd observation which is considered as an outlier. Previous diagnostic checking (Hoeting 
et al., 1999) showed that it is reasonable to assume a linear regression model. 



Predictor number 


Predictor 


mean 


s.d. 


Y 


Percent body fat (%) 


18.89 


7.72 


X! 


Age (years) 


44.89 


12.63 


x 2 


Weight (pounds) 


178.82 


29.40 


x 3 


Height (inches) 


70.31 


2.61 


a 4 


Neck circumference (cm) 


37.99 


2.43 


a 5 


Chest circumference (cm) 


100.80 


8.44 


x 6 


Abdomen circumference (cm) 


92.51 


10.78 


x 7 


Hip circumference (cm) 


99.84 


7.11 


a 8 


Thigh circumference (cm) 


59.36 


5.21 


x 9 


Knee circumference (cm) 


38.57 


2.40 


X\o 


Ankle circumference (cm) 


23.10 


1.70 


X u 


Extended biceps circumference 


32.27 


3.02 


X\2 


Forearm circumference (cm) 


28.66 


2.02 


X\3 


Wrist circumference (cm) 


18.23 


.93 



Table 6: Body fat example: summarized data 



We first consider the variable selection problem. We center the variables so that the in- 
tercept is not considered. Lasso chooses X\, A 2 , A 3 , X4, A 6 , X-j, X$, An, X 12 , A i3 in 
the final model with a BIC value 712.16, while aLasso has one fewer variable A 3 with a 
BIC value 709.46. BaLasso-Freq, BaLasso-Median, BaLasso-Mean and BaLasso-EB all choose 
Ai, A 2 , X4, A 6 , A 8 , An, A12, A13, one fewer variable (A7) than aLasso. The BIC value for 
BaLasso is 708.92, smaller than that of Lasso and aLasso. A simple analysis shows that A3 
and A7 are highly correlated to A 6 (the correlation coefficients are .89 and .92, respectively). 
Additionally, Xq is the most important predictor (Hoeting et al., 1999). Thus removing A3 and 
A7 from the model helps to avoid the multicollinearity problem. To conclude, BaLasso chooses 
the simplest model with the smallest BIC. 

We now proceed to explore model uncertainty inherent in this dataset. Let M(A) be the 
model selected w.r.t. shrinkage parameter vector A. We define the posterior model probability 
(PMP) of a model M to be 

p(M\D) = f p(X\D)d\. 

J\:M(\) = M 

Note that this is not a posterior model probability in the usual sense in formal Baycsian model 
comparison, but simply represents the uncertainty of the sparsity structure in the conditional 
posterior mode estimate induced by the uncertainty in the posterior distribution on the smooth- 
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Tabic 7: Body fat example: 10 models with highest posterior model probability 



ing parameter. From the Gibbs samples of A, it is straightforward to estimate these PMPs. Table 
7 presents 10 models with highest PMP which indicates high model uncertainty. The model with 
highest posterior probability and these 10 mostly selected models account for only 2.23% and 
16.8% of the total posterior model probability, respectively. With this model uncertainty, using 
a single model for prediction may be risky. 

We now examine the predictive performance of the approaches. To this end, we split the 
dataset (without standardizing) into two parts: the first 150 observations are used as the training 
set, the remaining observations are used as the prediction set. The out-of-sample predictive 
squared errors (PSEs) of aLasso, BaLasso-Mean, BaLasso-Median, BaLasso-EB, BLasso and 
BaLasso-BMA are 18.92, 18.28, 19.79, 19.00, 18.69, 18.13, respectively. Thus, for this dataset, 
BaLasso-BMA has the best predictive performance. 

Example 6: Prostate cancer data. Stamey et al. (1989) studied the correlation between the 
level of prostate antigen (Ipsa) and a number of clinical measures in men: log cancer volume 
(Icavot), log prostate weight (Iweight), age, log of the amount of benign prostatic hyperplasia 
(Ibph), seminal vesicle invasion (svi), log of capsular penetration (lep), Gleason score (gleason), 
and percentage of Gleason scores 4 or 5 (pgg45). We assume a linear regression model between 
the response Ipsa and the 8 covariates. We first consider the variable selection problem. The 
data set of size 97 is standardized so that the intercept /3q is excluded. Table 8 summarizes 
the selected smoothing parameters and estimated coefficients by various methods. Note that, 
for Lasso and aLasso there is just one smoothing parameter and putting the values on the first 
row as presented in the table does not mean these parameters are only associated with the first 
predictor. 





Selected A 








Coefficient estimate /3 




B aLasso 


B aLasso 


B aLasso 


Lasso 


aLasso 


BaLasso 


BaLasso 


BaLasso 


Lasso 


aLasso 


-EB 


-Median 


-Mean 






-EB 


-Median 


-Mean 






1.24 


1.19 


1.39 


2.40 


1.86 


0.563 


0.562 


.563 


.561 


.568 


1.59 


1.50 


1.76 






0.436 


0.436 


.436 


.357 


.437 


332.75 


841.05 


1066 















-.015 





55.78 


16.67 


20.41 















.1 





1.15 


1.08 


1.27 






0.587 


0.594 


.580 


.432 


.510 


97.61 


86.56 


113.2 





















89.77 


78.69 


105.12 





















754.38 


1241.70 


1823.7 















.005 






Table 8: Prostate cancer example: selected smoothing parameters and coefficient estimates 



11 



Models 


D1\,TD /0/\ 

JrJVLr (7q) 


1 2 




5 


27.9 


1 2 




5 8 


16.1 




4 


5 


6.3 




4 





5.9 


1 2 




8 


^ 7 


1 2 


4 


5 


5.1 


1 2 


3 


5 8 


4.9 


1 2 


3 4 


5 8 


4.9 




4 


5 8 


3.2 


1 2 






3.1 



Table 9: Prostate cancer example: 10 models with highest posterior model probability 

The EB estimation here is implemented using the stabilized Algorithm 2.2 of Atchadc (2009), 
in which the compact sets are selected to be <g>[— n— l,n + l], and the step-size a n — 2/n is 
obtained after a few trials by justifying the convergence of iterations A*-™-* graphically. As shown 
in Table 8, BaLasso-EB, BaLasso-Mean and BaLasso-Median give very similar estimates for 
corresponding to nonzero coefficients, but fairly different estimates for Xj corresponding to zero 
coefficients. The effects of increased penalty parameters on the zero coefficients are obvious: 
smaller shrinkage is applied to the nonzero coefficients and larger shrinkage is applied to those 
which should be removed. 

The adaptive Lasso and all of the proposed strategies (including BaLasso-Freq also) for 
variable selection produce the same model whose BIC is -25.19, while BIC of the model selected 
by Lasso is -21.38. Therefore the model chosen by our methods is favorable. 

Table 9 presents 10 models with highest PMP. The mostly selected model is the same as the 
one selected by aLasso and our methods. In comparison to the previous example, the presence 
of model uncertainty is not very clear in this case. The model with highest posterior probability 
accounts for 27.9% of the total which is considerably large. Moreover, this probability is also 
considerably different from that of the model with second highest posterior probability. 

To examine the predictive performance, we split the data set (without standardizing) into two 
sets: the first 50 observations form the training set D T , the rest form the prediction set D p . The 
PSEs of aLasso, BLasso, BaLasso-Median, BaLasso-BMA are 1.89, 1.91, 1.91, 1.86 respectively. 
Therefore, although the presence of model uncertainty is not very clear, BaLasso-BMA still 
provides comparable and slightly better estimates in terms of prediction. 



5 A Unified Framework 

So far, we have focused on BaLasso for linear regression. This section extends the BaLasso to 
more complex models such as generalized linear models, Cox's models and so on, with other 
penalties, such as the group penalty (Yuan and Lin, 2006) and the composite absolute penalty 
(Zhao, Rocha and Yu, 2009). This unified framework enables us to study variable selection in a 
much broader context. 

Denote by L((3) the minus log-likelihood. In order to use the BaLasso developed for linear 
regression, we approximate L{(3) by the least squares approximation (LSA) in Wang and Leng 
(2007) 



- ^>+^(/M)^(/M)'^(/M) 



= constant + i(/3-/3)lr 1 (/3-/3) 
where (3 is the MLE of (3 and S _1 :=d 2 L((3)/d(3 2 . To use the BaLasso for a general model, the 
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sampling distribution of y, conditional on /?, can be approximately written as 

And we only need to update the hierarchical model for y in the linear model using this expression 
while keeping other specifications intact. Now we discuss in detail three novel applications of 
BaLasso for models with flexible penalties. 

BaLasso with LSA. The frequentist adaptive Lasso for general models estimates /3 by mini- 
mizing 

L((3)+J2^Ml (11) 

Its Bayesian version is the following 

y\p ~ exp^-^-^'S- 1 ^-^) 
I3\t 2 ~ N p (0,D T ), D T = diag(r 2 ), 
r 2| A 2 „ TT Me-AjrjV*, 

A 2 ~ fl^Y-'e-^ 

3 = 1 

where t 2 := (t 2 ,...,t 2 )', A 2 := (A 2 ,...,A 2 )'. Note that we no longer have cr 2 in the hierarchy. The 
full conditionals are specified by 

/%,t 2 ,a 2 ~ Ar p ((s- 1 + J D- 1 )- 1 i:- 1 /3,(i:- 1 + i)- 1 )- 1 ), 

^2 = Tj'Ij/, A 2 - inverse-Gaussian ^ xfj , j = 1, 

r 2 

Aj|y,/3,r 2 - gamma(r + 1,5+ -|-), j = 

BaLasso for group Lasso. The adaptive group Lasso (Yuan and Lin, 2006) for general models 
minimizes 

J 

m+Ys^mik (12) 

3 = 1 

where /3j is the coefficient vector of the jth group, j — 1,...,J. The corresponding Bayesian 
hierarchy is as follows: 

2 



~ ^(o.t/i^), i = i,...,J 

2 1 \2 / TO," + 1 A 2 \ 

r/|A 2 - gamma — , ] , j = 1, J 



2 ' 2 

A 2 - gamma(r, S), j = 1, J 

where rn, is the size of group j, l mj is the identity matrix of order rnj. This prior was also used 
by Kyung et al. (2009) for grouped variable selection in linear regression. 
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The full conditionals can be obtained as follows. Let X be the square root matrix of E 1 
and y:~X/3. Write X = [X\,...,Xj] with block matrices Xj of size p x m,j . We have 

&|y,/3_„T 2 ,A 2 ~ N mj (Aj 1 X' j (y-^X j .l3 r ),Aj i y 



—9 = 7j \y> ft ^ 2 ~ inverse Gaussian [ ^ 

t/ vjlft" 



7j|t/,/3, A 2 <~ inverse Gaussian ( n q J n , A 2 J , 
A 2 ||/,£,t 2 - gamma jr+ mj 2 +1 ,(5+^- J , j = l,...,J, 



where P- j = {0 1 ,...,P j - 1 ,P j+1 ,...,0 J ) and ^ =XjX j + (l/r J 2 )]L m . . 

BaLasso for composite absolute penalty. We now consider the group selection problem in 
which a natural ordering among the groups is present. By j—tj', we mean that group j should 
be added into the model before another group j', i.e., if group j' is selected then group j must 
be included in the model as well. We extend the composite absolute penalty (Zhao, Rocha and 
Yu, 2009) by allowing different tuning parameters for different groups 

X! <\jll(ftAll j':j->j')\\l2, 
group j 

where fij is a coefficient vector and this penalty represents some hierarchical structure in the 
model. From this, the desired prior for j3 is the multi-Laplace 



^(/?)cXCXp \ Y; X MPhPj':j^')\\ 



which can be expressed as the following normal-gamma mixture 



__ j cxp ( i-j j exp (--^)dTj=e X p(\ j \\(0 j ,P j , :j ^ j , 



(13) 

where kj : = Wj+X}j'-j->j' m .?'- Similar to the Bayesian formulations before, this identity leads to 
the idea of using a hierarchical Bayesian formulation with a normal prior for /3\t 2 and a gamma 
prior for r 2 . More specifically, the prior for /3|r 2 will be 



/3\t 2 cx exp - ^2 



2r 2 



3 



This suggests that the hierarchical prior for ft|T 2 is independently normal with mean and 
covariance matrix (l/T^ + ^/.y^l/r 2 ,) -1 !^., j = l,...,J. We therefore have the following 
hierarchy 

y\P ~ e x V (-l(p-py£-\p-p)\, 



T * — ' T , ' 

3 3 



fr\r 2 ~ JV m . (0,a 2 l roj .), where a 2 := + ]T -i-)" 1 

r/|A 2 ~ gamma U—,^ 

A 2 ~ gamma(r, 5) for j = l,...,J. 
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n Lasso aLasso BaLasso 



200 3(2.15) 35(3.97) 36(6.19) 
300 5(2.42) 42(4.07) 90(5.10) 
500 4(2.66) 41(4.00) 100(5.00) 



Table 10: Example 1: Frequency of correctly-fitted models over 100 replications. The numbers 
in parentheses are average numbers of zero-coefficients estimated. The oracle average number is 
5. 

Full conditionals. It is now straightforward to derive the full conditionals as follows 

ftl^ft^r 2 ^ 2 

j 



N mj Uj 1 X' j (y-^2x r l3 r ),Aj 1 , 

V O'ft j 

inverse Gaussian ( — — — ■— , A 2 ) , 

\\\(Pj,Pj':j^j')\\ V 
( k 3 + 1 r T f\ ■ , 

gamma I r + — , 5 + -± I , j = 1, J 



where ft,- = (ft ,...,ft_ l5 ft +1 ,...,ft) and Aj=X' j Xj + (l/a 2 )TL mr 

We now assess the usefulness of this unified framework by three examples. For brevity, we 
only report the performance of various methods in terms of model selection. 

Example 7: BaLasso in logistic regression. We simulate independent observations from 
Bernoulli distributions with probabilities of success 

r>/ ii o\ exp(5+x</3) 



l+exp(5+x^/3) 



where /3 = (3, 1.5, 0, 0, 2, 0, 0, 0)', and x, = {x n ,...,x ip ) ~ A p (0,£) with o^- = 0.5^1. We 
compare the performance of the BaLasso to that of the Lasso and the aLasso. The performance 
is measured by the frequency of correct fitting and average number of zero coefficients over 100 
replications. The weight vector in aLasso is as usual assigned as w = 1/|/3 (0) |, where /?(°) is the 
MLE. The shrinkage parameters in Lasso and aLasso are tuned by 5-fold cross-validation. Table 
10 presents the simulation result for various sample size n. The aLasso in this example works 
better than the Lasso. The suggested BaLasso works very well, especially when the sample size 
n is large. In addition, the BaLasso often produces sparser models than the others do. 

Example 8: BaLasso for group selection. We consider in this example the group selection 
problem in a linear regression framework. We follow the simulation setup of Yuan and Lin (2006). 
A vector of 15 latent variables Z^iVi 5 (0,S) with (7^ =0.5'* — J l are first simulated. For each latent 
variable a 3-level factor Fj is determined according to whether Zj is smaller than $ _1 (l/3), 
larger than $ _1 (2/3) or in between. The factor Fi then is coded by two dummy variables. There 
are totally 30 dummy variables X\,...,X3o and 15 groups with f3j = (02j-i,02j)' , j = 1,...,J= 15. 
After having the design matrix A, a vector of responses is generated from the following linear 
model 

y = A/3 + e, e ~ N n (0, 1) (14) 

where most of ^ = except ft = (-1.2, 1.8)', /3 3 = (1, 0.5)', ft = (1, 1)'. We compare the 
performance of the BaLasso to that of the gLasso in Yuan and Lin (2006) and the adaptive 
group Lasso (agLasso, Wang and Leng, 2008) in terms of frequencies of correct fitting and 
average numbers of not-selected factors over 100 replications. We follow Wang and Leng (2008) 
to take the weights Wj = l/||ft MLE || with ft MLE are the MLE of ft. The tuning parameters in 
gLasso and agLasso are tuned using AIC with the degrees of freedom as in Yuan and Lin (2006). 
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n gLasso agLasso BaLasso 



100 5(6.64) 22(9.60) 15(14.86) 
200 8(6.92) 48(10.72) 90(12.04) 
500 7(7.24) 70(11.34) 100(12.00) 



Table 11: Example 8: Frequency of correctly-fitted models and average numbers (in parentheses) 
of not-selected factors over 100 replications. The oracle average number is 12. 

n gLasso agLasso BaLasso 

~700 18(4.25) 45(5.45) 72(7.28) 

200 36(5.16) 88(6.78) 100(7.00) 

500 34(5.24) 96(6.92) 100(7.00) 



Table 12: Example 9: Frequency of correctly-fitted models and average numbers (in parentheses) 
of not-selected effects over 100 replications. The oracle average number is 7. 

We use 1000 values of A equally spaced from to A max to search for the optimal value. Table 
11 reports the simulation result. Both gLasso and agLasso seem to select unnecessarily large 
models and have low rate of correct fitting. In contrast, the BaLasso seems to produce more 
parsimonious models when n is small. In general, the BaLasso works much better than the 
others in terms of model selection consistency. 

Example 9: BaLasso for main and interaction effect selection. In this example we 
demonstrate the BaLasso with composite absolute penalty for selecting main and interaction 
effects in a linear framework. We consider the model II of Yuan and Lin (2006). First, 4 factors 
are created as in the previous example, each factor is then coded by two dummy variables. The 
true model is generated from (14) with main effects j3i = (3, 2)', /?2 = (3, 2)' and interaction 
/?i-2 = (l, 1-5, 2, 2.5)'. There are totally 10 groups (4 main effects and 6 second-order interaction 
effects) with the natural ordering in which main effects should be selected before their corre- 
sponding interaction effects. We use the BaLasso formulation with composite absolute penalty 
to account for this ordering. Table 12 reports the simulation results. We observe that both 
gLasso and agLasso sometimes select effects in a "wrong" order (interactions are seclected while 
the corresponding main effects are not). As a result, they have low rates of correct fitting. 
The BaLasso always produce the models with effects in the "right" order. This fact has been 
theoretically proven in Zhao, Rocha and Yu (2009). In general, the BaLasso outperforms its 
competitors. 

6 Conclusion 

We have proposed the Bayesian adaptive Lasso which is novel in two aspects. First, we use 
an adaptive penalty and have proposed methods for tuning parameter selection and estima- 
tion. Second, we have proposed to use the posterior mode of the regression coefficients given 
the shrinkage parameters from their posterior for model averaging. Our approach retains the 
attractiveness of the usual Lasso in producing sparse models, and that of the aLasso in giving 
consistent models. Moreover, due to its Bayesian nature, an ensemble of sparse models, pro- 
duced as the posterior modes estimates, can be used for model averaging. Thus, our approach 
provides a novel and natural treatment of exploration of model uncertainty and predictive in- 
ference. Finally, we have proposed a unified framework which can be applied to select groups of 
variables (Yuan and Lin, 2006) and other constrained penalties (Zhao, Rocha and Yu, 2009) in 
more general models. Empirically, we have shown its attractiveness compared to its competitors. 
The software implementing our method is freely available from the authors' homepage. 
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